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Abstract:  In  this  paper,  we  study  the  model  selection  property  of  the  Elastic  net. 

In  the  classical  settings  when  p  (the  number  of  predictors)  and  q  (the  number  of 
predictors  with  non-zero  coefficients  in  the  true  linear  model)  are  fixed,  Yuan  and 
Lin  (2007)  give  a  necessary  and  sufficient  condition  for  the  Elastic  net  to  consistently 
select  the  true  model,  which  is  called  the  Elastic  Irrepresentable  Condition  (EIC) 
in  this  paper.  Here  we  study  the  general  case  when  p ,  q  and  n  all  go  to  infinity.  For 
general  scalings  of  p ,  q  and  n,  when  gaussian  noise  is  assumed,  sufficient  conditions 
on  p,  q  and  n  ar  given  in  this  paper  such  that  EIC  guarantees  the  Elastic  net’s 
model  selection  consistency.  We  show  that  to  make  these  conditions  hold,  n  should 
grow  at  a  rate  faster  than  q  log (p  —  q)-  For  the  classical  case,  when  p  and  q  are  fixed, 
we  also  study  the  relationship  between  EIC  and  the  Irrepresentable  Condition  (IC) 
which  is  necessary  and  sufficient  for  the  Lasso  to  select  the  true  model.  Through 
theoretical  results  and  simulation  studies,  we  provide  insights  into  when  and  why 
EIC  is  weaker  than  IC  and  when  the  Elastic  net  can  consistently  select  the  true 
model  even  when  the  Lasso  can  not. 

Key  words  and  phrases:  Lasso;  Elastic  net;  Model  selection  consistency;  Irrepre¬ 
sentable  Condition;  Elastic  Irrepresentable  Condition. 

1.  Introduction 

Regularization  has  been  a  popular  technique  for  model  fitting  in  statistical 
learning  when  the  number  of  predictors  p  is  large  compared  with  the  number  of 
observations  n.  Regularization  methods  have  been  shown  to  have  a  better  accu¬ 
racy  of  prediction  on  future  data  (Tikhonov,  1943;  Hoerl  and  Kennard,  1970). 
The  Lasso  (Tibshirani,  1996)  which  regularizes  with  an  L\  penalty,  can  also 
generates  sparse  models,  which  are  more  interpretable.  The  Lasso  provides  a 
computationally  feasible  way  for  model  selection  (Osborne  et  al,  2000;  Efron  et 
al  2004;  Rosset,  2004;  Zhao  and  Yu,  2007).  But  the  Lasso  does  not  perform  well 
when  the  predictors  are  highly  correlated  or  the  number  of  predictors  is  much 
greater  than  the  number  of  observations.  Zou  and  Hastie  (2005)  proposed  the 
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Elastic  net,  which  also  has  the  property  of  sparsity,  to  solve  the  above  prob¬ 
lems.  Zou  and  Hastie  (2005)  state  that  the  Elastic  net  regularization  “is  like  a 
stretchable  fishing  net  that  retains  all  the  big  fish”  and  that  “Simulation  studies 
and  real  data  examples  show  that  the  Elastic  net  often  outperforms  the  Lasso  in 
terms  of  prediction  accuracy” . 

In  this  paper,  we  intend  to  understand  the  model  selection  performance  of 
the  Elastic  net,  relative  to  the  Lasso.  We  obtain  theoretical  results  showing  that 
the  Elastic  net  can  select  the  true  model  consistently  when  the  sparsity  measure, 
the  total  number  of  predictors,  and  the  sample  size  all  go  to  infinity.  We  use 
both  theoretical  results  and  simulation  studies  to  shed  light  on  when  and  why 
the  Elastic  net  can  outperform  the  Lasso  for  model  selection. 

Assume  our  data  consists  of  a  design  matrix  X  £  RnxP  and  the  response 
vector  Y  £  Rn.  They  follow  a  linear  regression  model 

Y  =  Xp  +  e ,  (1.1) 

where  e  =  (ei, . . .  ,en)T  is  a  vector  of  i.i.d.  additive  Gaussian  noise  with  mean 
0  and  variance  cr2.  Throughout  this  paper,  the  design  matrix  X  is  treated  as 
a  deterministic  (non-random)  matrix.  For  the  random  case  all  the  conclusions 
can  be  obtained  by  conditioning  on  X.  p  is  the  vector  of  model  coefficients. 
The  model  is  assumed  to  be  “sparse”,  i.e.  most  of  the  regression  coefficients  P 
are  exactly  zero  corresponding  to  predictors  that  are  irrelevant  to  the  response. 
Without  loss  of  generality,  assume  the  first  q  elements  of  vector  f3  are  non-zeroes. 
Let  Pm  =  (Pi, . . . ,  Pq )  and  /?(2)  =  (Pq+u  ■  ■  ■  > Pp),  then  Pm  /  0  element-wise  and 

P(2)  =  0. 

Write  Xm  and  Xm  as  the  first  q  and  the  last  p  —  q  columns  of  design  matrix 
X  respectively  and  let  C(n)  =  yiXTX.  For  simplicity,  C(n)  is  denoted  by  C, 
which  is  a  function  of  n.  C  can  be  expressed  in  a  block-wise  form: 

Cll  C>2 

C*21  C*22 

where  Cu  =  Aa  =  =  kX(2)XW  and  C22  =  ^2)X(2). 

The  naive  Elastic  net  estimate  P  is  defined  as 

/3(naive)  =  argmin||T  -  XP\\\  +  A2||/3|||  +  Ai||/3||i, 


(1.2) 
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where  parameters  Ai  and  A2  control  the  amount  of  regularization  applied  to  the 
estimate.  A2  =  0  leads  the  naive  elastic  estimate  back  to  the  Lasso  estimate. 

Since  the  Elastic  net  estimate  /3(Elastic  net)  is  defined  as  (1  +  A2)/3(naive), 
it  selects  the  same  model  as  the  naive  Elastic  net  estimate.  In  this  paper,  we  will 
call  the  naive  Elastic  net  estimate  (/3)  the  Elastic  net  estimate. 

Recent  works  (Zhao  and  Yu,  2006;  Zou,  2006;  Meinshausen  and  Yu,  2007; 
Yuan  and  Lin,  2007)  have  worked  precisely  on  the  model  selection  consistency 
of  the  Lasso.  It  has  been  shown  that  in  the  classical  case  when  p  and  q  are 
fixed,  a  simple  condition  called  the  Irrepresentable  Condition  on  the  generating 
covariance  matrices  is  necessary  and  sufficient  for  the  Lasso’s  model  selection 
consistency.  IC  is  defined  in  Zhao  and  Yu  (2006)  as: 

Irrepresentable  Condition  (IC).  There  exists  a  positive  constant  77  >  0, 


C2\C^  (sign(J3(  1))) 


<  1  -  »7. 

OO 


(1.3) 


where  the  inequality  holds  element-wise. 

More  precise  results  for  the  p  »  n  case  are  in  Wainwright  (2006),  which 
was  the  first  to  give  conditions  for  the  Lasso’s  model  selection  consistency  in  the 
case  of  general  scalings  of  p,  q  and  n.  Yuan  and  Lin  (2007)  concentrate  mainly 
on  non-negative  garotte,  but  contain  a  necessary  and  sufficient  condition  for  the 
Elastic  net  to  select  the  true  model  in  the  classical  settings  when  p  and  q  are 
fixed.  EIC  is  defined  as: 

Elastic  Irrepresentable  Condition  (EIC).  There  exists  Ai,  A2  and  a  pos¬ 
itive  constant  77  >  0, 


C,2i(C,n  +  —I)-1 
n 


2X2 

sign((3(  1))  +  —/Sm 


<  1  -  V, 


(1.4) 


where  the  inequality  holds  element-wise. 

EIC  is  exactly  IC  when  when  A2  =  0  and  C\  1  is  invertible.  EIC  does  not 
need  C\  \  to  be  invertible.  If  A2  is  preselected  and  fixed,  when  Ai  goes  to  00,  the 
Elastic  Irrepresentable  Condition  reverses  back  to  the  Irrepresentable  Condition. 
Generally  speaking,  if  the  Irrepresentable  Condition  holds,  then  there  exist  some 
Ai  >  0  and  A2  >  0  such  that  the  corresponding  elastic  Irrepresentable  Condition 
holds.  The  relationship  between  EIC  and  IC  will  be  further  studied  in  Section  3. 

In  this  paper,  we  analyze  the  model  selection  consistency  of  Elastic  net  for 
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general  scalings  of  p,  q  and  n.  The  fixed  p  and  q  case  is  a  special  case.  For 
the  classical  settings,  we  do  more  analysis  than  that  in  Yuan  and  Lin  (2007). 
Through  special  models  and  simulations,  we  study  the  relationship  between  EIC 
and  IC;  we  show  that  EIC  is  weaker  than  IC  and  that  the  Elastic  net  can  select 
the  true  model  even  when  the  Lasso  can  not.  For  the  general  case,  we  give 
sufficient  conditions  on  the  relationship  of  p,  q  and  n  such  that  EIC  guarantees 
the  Elastic  net’s  model  selection  consistency. 

The  rest  of  the  paper  is  organized  as  follows.  In  Section  2,  we  give  our  main 
results.  For  the  general  scalings  of  p,  q  and  n,  conditions  on  the  relationship 
between  p,  q  and  n  are  given  such  that  that  EIC  is  sufficient  for  the  Elastic  net 
to  select  the  true  model.  In  Section  3,  we  compare  the  Elastic  Irrepresentable 
Condition  with  the  Irrepresentable  Condition.  Simulation  studies  are  shown  in 
Section  4.  In  Section  5,  we  conclude  and  propose  the  future  directions  for  this 
research.  The  longer  proofs  can  be  found  in  the  appendix. 

2.  Model  Selection  Consistency 

We  follow  the  notations  and  definitions  of  sign  consistency  defined  in  Zhao 
and  Yu  (2006)  and  Wainwright  (2006).  Define  (3  =s  (3,  if  vector  f3  and  the  true 
parameter  /3  have  the  same  sign  element-wise. 

Definition  1.  Property  TZ(X,  [3,  e,  Ai,  A2):  There  exists  an  optimal  solution 
/3(Ai ,  A2)  for  model  (1.2)  with  the  property  f3  =s  f3. 

Definition  2.  The  Elastic  net  estimate  is  Sign  Consistent  if  there  exists  Ai,  A2 
such  that 

lirn  P((3( Ai,  A2)  =s  (3)  =  1. 

n— >00 

Note  that  the  Elastic  net  estimate  /3(Ai,  A2)  is  sign  consistent  if  and  only  if 
P[JZ(X ,  f3,  e,  Ai,  A2)]  — >  1  as  n  — +  00. 

When  p  and  q  are  fixed,  Yuan  and  Lin  (2007)  have  shown  that  EIC  is  a 
necessary  and  sufficient  condition  for  the  Elastic  net  to  consistently  select  the  true 
model.  We  show  that  when  p,  q  and  n  all  go  to  infinity,  under  some  conditions 
on  the  relationship  between  p.  q  and  n,  EIC  also  guarantees  that  the  Elastic  net 
consistently  selects  the  true  model. 

We  first  state  necessary  and  sufficient  conditions  for  property  TZ(X,  (3 ,  e,  Ai ,  A2) 
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to  hold  in  Lemma  1,  which  is  a  consequence  of  KKT  (Karush-Kuhn- Tucker)  con¬ 
ditions. 

Lemma  1.  For  any  given  >  0,A2  >  0  and  noise  vector  e  G  Mn,  property 
TZ(X,  /3,  e,  Ai,  A2)  holds  if  and  only  if 


2X(t2)X(1)  (aT^AC^  +  A2/)_1  X^e  -  y^n(Ai))  -  A2/?(i)  -  2Xf2)e  <  Ai, 

(2.1) 

(^)X{1)  +  A2i)'1  X^X{l)(dw+X^e-^sign((dw)  >0.  (2.2) 

For  shorthand,  define  b  :=  sign(/?(10  and  denote  by  e*  the  vector  with  1  in 
the  i'th  position  and  zeroes  elsewhere.  For  each  index  i  G  <5  =  {1, 2, . . . ,  q}  and 
j  G  Sc  =  {q  +  1, . . .  ,p},  define  the  following  random  variables: 

Ui:=ef(xf1)X{1)  +  X2iy1  xfoe-^t  ,  (2.3) 

Vj  ■=  2 Xj  {aT(1)  (xfaXw  +  A 2iyl  (^V  +  A2/9(1)) 

-  *(1)  (^i)X(i)  +  A 2 /)  *  X[1}  -  i]  e}  .  (2.4) 


These  random  variables  will  play  an  important  role  in  our  analysis.  In  par¬ 
ticular,  condition  (2.1)  holds  if  and  only  if  the  event 

A4( V)  :=  | max  |  V) |  <  Ail  (2-5) 

holds.  On  the  other  hand,  if  we  define  p  :=  min  ^Al^AT^)  +  A  2/j  Al^AT(i)/3(i) 

then  the  event 


M.(U )  :=  |my  Wi\  <  Pj  (2.6^ 

is  sufficient  to  guarantee  that  condition  (2.2)  holds. 

In  the  zero-noise  setting  (e  =  0),  the  conditions  in  Lemma  1  will  reduce  to 


Xf2)X(1)  (xf1}x(1)  +  \ 2l)  1  sign (J3(1))  +  ^0W  <1,  (2.7) 

(aT^AT^+A 2/)_1  X^X^  -  ^-sign(/3(1))  >0.  (2.8) 
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When  noises  exist,  under  some  conditions  on  the  relationship  between  the 
scalings  of  p,  q  and  n,  the  Elastic  Irrepresentable  Condition  is  still  sufficient  for 
the  property  of  1Z(X,  (3 ,  e,  Ai,  A2)  to  hold  with  probability  tending  to  1  as  n  — ►  00: 

Theorem  1.  Suppose  that  Y  =  X/3  +  e,  where  each  column  of  X  is  nor¬ 
malized  to  l2-norm  n  and  e  r\j  N(0,  a2 1).  Assume  EIC  (1.4)  holds.  Define 
p  :=  min  (Cu  +  [Cu/3(i)\  ,  and  Cmin  =  Amm(C'ii)  +  yf ,  where  Amin(-) 

denotes  the  minimal  eigenvalue.  If  X\  is  chosen  such  that 


then  P[1Z(X ,  (3,  e,  X\,  A2)]  —*  1  as  n  — >  00. 


A  proof  of  Theorem  1  can  be  found  in  the  appendix. 

Theorem  1  gives  a  general  result  for  general  scalings  of  p.  q  and  n.  In  the 
classical  setting  where  p  and  q  are  fixed,  if  Cu  converges  to  a  non-negative  definite 
matrix  Co,  p  will  converge  to  a  non-negative  number  po.  Suppose  po  >  0,  then 
condition  (a)  is  equivalent  to  Xi/y/n  —>  00  and  condition  (b)  is  equivalent  to 
Ai / n  — >•  0,  if  Crnin  >  a  for  some  a  >  0. 

Corollary  1.  When  p  and  q  are  fixed,  suppose  that  C\  1  converges  to  Co,  po  >  0 
and  Cmin  >  ol  for  some  a  >  0,  then  EIC  implies  P[IZ(X,  j3,  e,  Ai,  A2)]  — >  1  as 
n  — >  00,  if 

(а)  X\/y/n  — >  00, 

(б)  X\/n  — >  0. 

Note  that  Ai  =  yAilogn  is  a  suitable  choice.  A  similar  conclusion  is  also 
reached  in  Meinshausen  and  Buhlmann  (2006),  Zhao  and  Yu  (2006),  Zou  (2006) 
and  Wainwright  (2007)  for  the  Lasso  to  select  the  true  model.  Regarding  con¬ 
straints  on  A2:  when  C\  1  is  invertible  and  Amjn(C'n)  >  a,  for  some  a  >  0,  any 
A2  >  0  can  be  chosen  as  long  as  it  satisfies  EIC;  when  Cu  is  not  invertible, 
A2  =  jn  can  be  chosen,  for  any  7  >  0  which  satisfies  EIC. 

When  all  three  parameters  ( n,p ,  q)  grow  into  infinity,  suppose  that  Cmin  >  a, 
for  some  a  >  0  and  p  >  po,  for  some  po  >  0.  Then  we  have 
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Corollary  2.  EIC  implies  that  the  Elastic  net  has  sign  consistency  if 

(°)  nlog(p-g)  00  ’ 

W  ^  -  0, 


(c) 


0. 


Proof.  Note  that 


cu  +  ¥1 


-1 . 


<  cjj\  b  ||2  =  Cm]ny/q.  So,  conditions 


—  min  1 


(6)  and  (c)  in  Corollary  2  guarantee  that  condition  (b)  in  Theorem  1  holds.  □ 


The  conditions  ralog^_q)  (=  (^)2  X  qlo^p_q))  +00  and  ^  0  imply 

that  the  number  of  observations  n  must  grow  at  a  rate  faster  than  qlog(p  —  q ). 


3.  Relationship  between  EIC  and  IC 

As  shown  in  Zou  and  Hastie  (2005),  the  Elastic  net  can  select  the  “important” 
variables  for  prediction  and  it  often  outperforms  the  Lasso  in  terms  of  prediction 
accuracy.  Under  some  conditions,  we  have  shown  that  in  theory  it  consistently 
selects  the  relevant  predictors.  In  this  section,  we  will  show  theoretically  that  the 
Elastic  net  often  outperforms  the  Lasso  in  terms  of  model  selection  consistency. 

Proposition  1.  Irrepresentable  Condition  implies  Elastic  Irrepresentable  Con¬ 
dition,  but  Elastic  Irrepresentable  Condition  does  not  imply  Irrepresentable  Con¬ 
dition. 


This  result  is  trivial,  since  A2  =  0  or  small  A2  >  0  leads  EIC  back  to  IC. 

Proposition  1  shows  that  when  the  Lasso  can  select  the  true  model,  the 
Elastic  net  also  can  select  the  true  model;  the  Elastic  net  often  outperforms  the 
Lasso  in  terms  of  model  selection  consistency.  We  have  to  point  out  that  it  may 
happen  that  in  some  situations  neither  the  Lasso  nor  the  Elastic  net  can  select 
the  true  model,  which  can  be  seen  by  simulations  in  Section  4. 

An  interesting  question  is  under  what  conditions,  the  Elastic  net  will  do  a 
much  better  job  than  the  Lasso  for  model  selection.  In  other  words,  what  prior 
information  about  the  model  parameters  would  suggest  that  the  Elastic  net  will 
select  the  true  model  while  the  Lasso  does  not?  It  is  hard  to  answer  this  question 
in  general.  But,  in  some  situations,  we  can  provide  some  insight  into  when  the 
EIC  will  hold  while  IC  does  not. 
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Consider  the  case  p—  q  =  1,  that  is,  there  exists  only  one  irrelevant  predictor. 
This  is  the  simplest  model  selection  problem.  For  this  kind  of  problem,  we  can 
give  a  simple  necessary  and  sufficient  condition  such  that  EIC  holds. 

Theorem  2.  In  the  case  when  p  —  q  =  1,  EIC  holds  if  and  only  if 

C2i(Cii  +  —)~1sign{^ i))  >  1  and  C2i(Cn  +  — )_1/?(i)  <  0,  (3.1) 

or 

C2i(Cn  +  ^)-1si5n(/3(1))  <  -1  and  C2i(Cu  +  —  )~1/5(i)  >  0,  (3.2) 

n  v  '  n  w 

or 

\C2l(Cu  +  —  )_1sig,n(/3(1))|  <  1  —  rj,  for  some  0  <  rj  <  1  .  (3.3) 

Proof.  When  p  -  q  =  1,  C'2i(C'ii  +  ^)_1.sf5rn(/3(i))  and  C2i(Cn  +  are 

both  scalers.  Immediately,  (1.4)  is  equivalent  to  conditions  (3.1),  (3.2)  and  (3.3) 
by  choosing  a  suitable  Ai.  □ 

Choosing  the  appropriate  A2  for  Theorem  2  requires  difficult  manipulations. 
Below,  we  give  sufficient  conditions  to  ensure  that  EIC  holds  and  IC  does  not  for 
any  fixed  value  of  A2. 

Corollary  3.  Suppose  C\  \  invertible,  in  the  case  when  p  —  q  =  1,  for  any  fixed 
value  A2,  when  n  is  very  large,  EIC  holds  while  IC  does  not  if 

C2lCu]sign(fI(1))  >  1  and  C2iCfl1p^  <  0  (3.4) 

or 

C2\Cfi  sign(P^)  <  -1  and  C2iCf^P^  >  0  (3.5) 

Proof.  When  A2  =  0,  (3.1)  is  exactly  condition  (3.4)  and  (3.2)  is  exactly  condition 
(3.5).  A2  is  not  allowed  to  be  0,  a  small  A2  or  a  small  —  can  be  chosen,  such  that 
conditions  (3.1)  and  (3.2)  hold,  each  of  which  is  sufficient  for  EIC  to  hold.  □ 

Denote  T  by  the  estimated  regression  of  the  linear  model  X(2^  =  + 

noise.  It  can  be  the  OLS  estimate  Cl1”11Cli2  as  in  Corollary  3  or  the  ridge  regres¬ 
sion  estimate  (C\  i  +  ^f)~1C\2  as  in  Theorem  2.  Theorem  2  and  its  corollary 
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(Corollary  3)  suggest  that  if  the  Lasso  does  not  select  the  true  model,  it  is  be¬ 
cause  \^Tsign((3{1))\  is  too  large.  But  the  Elastic  net  might  be  able  to  conquer 
this  problem  by  introducing  another  penalty  term  'LT/3(1)  on  ^>T  signed t i))  such 
that  the  absolute  value  of  the  new  term  sign{fyi))  +  a^T  fyi)  is  not  very  large 
for  some  a  >  0.  The  small  absolute  value  of  the  new  term  implies  that  the  EIC 
holds,  and  therefore  the  Elastic  net  can  consistently  select  the  true  model. 

In  the  situations  when  p  —  q  >  2,  explanations  about  EIC  are  complicated. 
But  conditions  (3.1)  and  (3.2)  are  necessary  conditions  such  that  EIC  holds.  We 
state  it  as  a  corollary  of  Theorem  2. 

Corollary  4.  In  the  case  when  p  —  q  >  1,  EIC  holds  only  if 


and 


C2i(Cu  +  -)-1(3{1) 

n  y  ' 


C2i(Cu  +  -)-1(3{i) 
n  y  ' 


<  0  when 


>  0  when 


C2i(Cii  +  ^)  1  signify  i)) 


C2i(Cii  +  ^)  1  signify  i)) 


>  1, 


(3.6) 

<  -1, 

(3.7) 


where,  [-]j  denote  the  i—th  element  of  a  vector. 


Proof.  When  condition  (3.6)  or  (3.7)  does  not  hold,  then 


CziiCn  +  ^)  i)  +  si9nifyi))]i 


> 


[C'21(C'11  +  ^)-1/3(1)]i 


>  1 


which  violates  EIC. 


□ 


4.  Simulations 

Zou  and  Hastie  (2005)  contain  many  experiments  to  show  that  the  Elastic 
net  performs  much  better  than  the  Lasso,  OLS  and  ridge  regression  in  terms  of 
prediction  accuracy,  but  they  did  not  compare  the  model  selection  performances 
between  the  Lasso  and  the  Elastic  net.  Yuan  and  Lin  (2007)  also  have  no  example 
to  show  the  differences  of  the  performance  on  the  model  selection  consistency 
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between  the  Lasso  and  the  Elastic  net.  In  this  section,  some  simulations  are 
provided  to  show  that  when  the  Lasso  can  not  select  the  true  model,  the  Elastic 
net  may  still  select  the  true  model.  When  p  »  n,  especially  when  q  >  n,  the 
Lasso  can  select  at  most  n  variables  before  the  model  saturates.  So,  when  q  >  n, 
the  Lasso  theoretically  can  not  select  all  of  the  true  predictors  .  We  will  give  an 
example  to  show  that  the  Elastic  net  might  be  able  to  solve  this  kind  of  problems. 

In  the  first  3  examples,  p  and  q  are  small  compared  to  n  =  1000.  These 
examples  can  be  treated  as  fixed  p  and  q  cases.  Because  of  the  large  number 
of  observations,  the  results  are  consistent  and  the  plots  appear  the  same  for 
multiple  simulations.  From  Corollary  1  and  Corollary  3,  it  can  be  seen  that  the 
choice  of  A2  is  not  very  important.  In  these  examples,  we  take  A2  =  100.  We  did 
many  simulations  with  different  A2’s,  and  did  not  see  much  effect  of  A2  on  the 
performance  of  model  selection  consistency. 

Example  1 .  The  first  example  has  the  same  settings  as  Zhao  and  Yu  (2006) . 
They  gave  an  example  with  p  =  3  to  show  that  when  the  Irrepresentable  Con¬ 
dition  holds  there  is  a  consistent  Lasso  solution  and  when  the  Irrepresentable 
Condition  does  not  hold,  there  is  no  consistent  Lasso  solution. 

Xi,X2,e  and  e  are  first  generated  from  the  standard  normal  distribution 
with  mean  0  and  variance  1.  X3  is  generated  to  be  correlated  with  X\  and  X2 
by 

Vi  =  3W  +  |v2  +  ie, 

which  also  has  a  standard  normal  distribution.  The  true  linear  model  is: 

Y  =  X&  +  X2(32  +  e. 

Now,  consider  two  settings:  (a)  =  2,  fj2  =  3  and  (b)  (3\  =  —2,  /32  =  3.  In 

both  settings,  X^  =  {X\,  X2),  X/2)  =  X%  and  it  is  easy  to  check  that  C22C -Q1  = 
(|,  |).  So,  setting  (b)  makes  Irrepresentable  Condition  hold,  while  setting  (a) 
does  not.  The  Lasso  and  the  Elastic  net  are  applied  to  both  settings  (a)  and 
(b)  respectively  and  the  solution  pathes  are  shown  in  Figure  4.1  and  Figure  4.2. 
Figures  4.1  and  4.2  show  that  in  setting  (a),  neither  the  Lasso  nor  the  Elastic 
net  can  select  the  true  model  and  in  setting  (b),  both  the  Lasso  and  the  Elastic 
net  can  select  the  true  model. 

Example  2.  This  example  is  used  to  illustrate  that  when  the  Irrepresentable 


Standardized  Coefficients  Standardized  Coefficients 
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O  1  2  3 


Elastic  net 


Figure  4.1:  the  Lasso  solution  paths  for  setting  (a) 


LASSO 


Elastic  net 


Figure  4.2:  Elastic  net  solution  paths  for  setting  (b) 
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Condition  does  not  hold,  the  Elastic  Irrepresentable  condition  may  hold  and  the 
Elastic  net  will  select  the  true  model,  while  the  Lasso  does  not.  In  this  example, 
p  =  6.  X\,  X2,  X3,  X4,  Xq,  e  and  e  are  first  generated  from  the  standard  normal 
distribution.  Xq,  also  from  the  standard  normal  distribution,  is  generated  to  be 
correlated  with  X\,  X2,  X3,  X4  and  Xq  by 

Xq  =  -Xi  +  -X2  +  -X3  +  -X4  +  -X5  +  ?ye, 

where  the  constant  ij  =  -^2-  is  used  to  make  Xq  have  variance  1.  The  regression 
model  is 

Y  =  PiXi  +  (32X2  +  foX3  +  p4X4  +  (35X5  +  e. 

Now  AT)  =  (Xi,X2,X3,X4,Xq),  X(2j  =  Xq  and  it  is  easy  to  check  that 
C21C^1  =  (|,  j,  7j,  ^).  Suppose  that  (3\  <  0,(32  <  (1,  >  0, /?4  >  0  and  Bq  >  0. 

It  is  easy  to  check  that 

|C'2iC'1"11(s^n(/3{i))|  =  ^  >  1, 

so,  the  Irrepresentable  Condition  does  not  hold. 

In  the  settings  above,  a  sufficient  condition  can  be  given  such  that  the  Elastic 
net  select  a  consistent  model.  This  condition  is  a  direct  consequence  of  Corollary 
3. 

In  the  settings  of  Example  2,  the  Elastic  Irrepresentable  Condition  holds  if 

—  (Pi  +  2/^2)  >  4/?3  +  4/^4  +  4/?5  (4-1) 

Now  let  /?i  =  —4,  (52  =  —2,  =  0.5,  @4  =  0.6  and  /3q  =  0.7.  It  is  easy  to 

check  that  inequality  (4.1)  holds.  The  Lasso  is  first  used  to  get  the  solution  path 
shown  in  Figure  4.3  (a)  and  then  the  Elastic  net  is  used  to  get  the  solution  path 
shown  in  Figure  4.3  (b).  The  figure  shows  that  the  Lasso  does  not  select  the  true 
model  while  the  Elastic  net  does. 

Example  3.  As  reported  in  Zou  and  Hastie  (2005),  when  predictors  are 
highly  correlated,  the  Lasso  tends  to  select  only  one  of  these  highly  correlated 
predictors.  Especially,  when  there  are  two  predictors  which  are  the  same,  theo¬ 
retically,  the  Lasso  can  not  select  both  of  them.  In  this  example,  we  will  show 
that  the  Elastic  net  can  select  both  of  them  and  can  select  the  true  model.  By 
this  example,  we  also  show  that  when  C\  \  is  not  invertible,  we  can  still  consider 
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Figure  4.3:  Lasso  and  Elastic  net  solution  paths 

the  consistency  of  the  Elastic  net.  While  the  consideration  of  consistency  of  the 
Lasso  needs  the  assumption  that  C\\  is  invertible. 

X\ .  X‘2 ,  e  and  e  are  first  generated  from  a  normal  distribution  with  mean  0 
and  variance  1.  Let  X3  =  X2.  X4  is  generated  to  be  correlated  with  X\ ,  X-2  and 
^3  by 

X4  =  -*1  +  ix2  +  l*3  +  -e, 

which  also  has  a  standard  normal  distribution.  The  true  linear  model  is: 

Y  =  -2X4  +  X2  +  X3  +  e. 

The  Lasso  and  the  Elastic  net  are  applied  separately  and  the  solution  paths 
are  shown  in  Figure  4.4.  This  figure  shows  that  the  Elastic  net  selects  the  true 
model  while  the  Lasso  does  not. 

Example  4.  In  this  example,  we  want  to  illustrate  that  if  p  >>  n,  and  EIC 
holds,  then  conditions  in  Corollary  2  of  Theorem  1  guarantee  that  the  Elastic 
net  can  select  the  true  model.  In  the  p  >  n  case,  the  Lasso  selects  at  most  n 
variables  before  it  saturates.  So  if  q  >  n,  the  Lasso  cannot  select  the  true  model. 

Set  q  =  50  and  p  =  52.  From  the  comments  after  Corollary  2,  n  is  supposed 
to  grow  at  a  rate  faster  than  qlog(p  —  q),  which  is  equal  to  50  x  log 2  =  35.  So 
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Figure  4.4:  Lasso  and  Elastic  net  solution  paths;  X2  =  X3 


here  we  choose  n  =  46  which  is  less  than  q.  The  design  matrix  X  is  generated 
from  joint  standard  normal  distribution  N(0,  Ipxp).  Set  A2  =  0.01  and  simulate 
X,  such  that  X  satisfies  621(611  +  ^)~l  x  1  <  1,  where  1  is  a  column  vector 
with  all  entries  being  1.  Let  (3  =  [/3(i) ,  /3(2)  ]  j  where  (3 m  is  a  q— vector  with 
all  entries  being  1  and  /3^)  is  a  (p  —  q)— vector  with  all  entries  being  0.  Since 
621(611  +  ^f)-1  (sign(f3{ i))  +  ^/3(i))  =  (1  +  ^)62i(6n  +  ^)~l  x  1,  there 
exists  some  Ai  such  that  EIC  holds.  The  true  model  is:  Y  =  X/3  +  0.04  x  e.  Then 
the  Elastic  net  is  applied.  The  solution  path  is  shown  in  Figure  4.5. 

After  examining  the  solution  on  the  path,  we  find  that  the  solution  corre¬ 
sponding  to  the  vertical  line  in  Figure  4.5  recovers  exactly  the  first  q  non-zero 
predictors.  Theoretically,  the  Lasso  can  select  at  most  n  =  46  variables  and  so 
the  Lasso  does  not  perform  well  on  this  data.  After  applying  the  Lasso  on  this 
simulated  data,  we  find  that  it  can  only  select  45  variables  at  most. 

5.  Conclusion 

In  this  paper,  we  have  discussed  the  ability  of  the  Elastic  net  to  recover  the 
sparsity  pattern  of  regression  coefficients  (3.  EIC  is  crucial  for  the  Elastic  net’s 
model  selection  consistency.  In  the  classical  case  when  p  and  q  are  fixed,  EIC  is 
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Figure  4.5:  Elastic  net  solution  paths  for  large  p,  q ,  small  n 


necessary  and  sufficient  for  the  Elastic  net  to  consistently  select  the  true  model 
(Yuan  and  Lin,  2007).  When  p  and  q  both  grow  as  n  grows,  EIC  is  not  sufficient 
any  more.  Some  conditions  between  the  relationship  of  p,  q  and  n  are  required. 
In  this  paper,  for  our  consistency  results,  it  is  required  that  n  grows  at  a  rate 
faster  than  q\og(p  —  q).  When  p  >  n,  as  in  Example  4,  the  Elastic  net  performs 
better  than  the  Lasso. 

We  compared  the  ability  of  the  Elastic  net  to  select  the  true  model  with  that 
of  the  Lasso.  EIC  is  weaker  than  IC.  So,  the  Elastic  net  always  performs  better 
than  the  Lasso  in  terms  of  model  selection  consistency.  From  Example  2,  it  can 
be  seen  that  when  the  Lasso  can  not  select  the  true  model,  the  Elastic  net  may 
select  the  true  model.  But  we  also  see  that  in  some  situations,  neither  the  Lasso 
nor  the  Elastic  net  selects  the  true  model  (see  Example  1).  Example  3  is  used 
to  show  that  when  the  true  predictors  are  highly  correlated,  the  Lasso  does  not 
select  all  the  highly  correlated  variables.  Yet,  the  Elastic  net  can  select  all  of 
them. 

At  last,  we  propose  future  directions  for  this  research.  From  Theorem  1 
and  its  corollaries,  the  choice  of  A2  does  not  affect  the  Elastic  net’s  ability  to 
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select  the  true  model  when  EIC  hold.  This  suggests  that,  in  practice,  we  can 
find  a  suitable  A2  such  that  the  EIC  holds  before  the  Elastic  net  is  applied  to  do 
model  selection  and  prediction.  In  some  situations  (see  Corollary  3),  any  fixed 
A2  satisfies  EIC.  But  in  general,  how  to  choose  a  suitable  A2  such  that  EIC  holds 
should  be  studied  further. 
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Appendix:  Proofs 

Proof  of  Lemma  1.  By  standard  (KKT)  conditions  for  optimality  in 
convex  program,  the  point  (5  is  optimal  if  and  only  if 

2XTX(3  -  2XtY  +  2A2/5  +  X\z  =  0.  (1) 

Here 


z  = 


sign  (A)  Pi  /  0 

any  real  number  which  6  [—  1, 1]  Pi  =  0. 
Substituting  Y  by  XP  +  e  yields: 

2XT X 0  -  p)  -  2XTe  +  2\2/3  +  Ai z  =  0. 

Since  condition  7 Z(X,  P,  e,  Ai,  A2)  holds  if  and  only  if  we  have 


(2) 


P( 2)  =  0, 4(1)  +  0,  and  %)  =  sign (/?(1)),  \z(2)\  <  1- 

From  these  conditions  and  using  equation  (2),  we  conclude  that  the  condition 
TZ(X,  P,  e,  Ai ,  A2)  holds  if  and  only  if 


2Ap)Y(1)(/3(1)  -/?(!))-  2A^e  —  —  Ai Z(2),  (3) 

2A^)A(i)(/3(i)  — /3(i))  —  2A^)e  +  2A2/3(r)  =  —  Aisign^x)).  (4) 
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By  these  two  equations,  we  may  solve  for  (5^  and  zr2)  to  conclude  that 

-Ai%)  =  2X^X(1)(X^X{1)  +  A 2I)~1(X[1)e  -  ysign(/3(1))  -  A 2/J(1))  -  2X[2)e 

P(i)  =  (Al^)Al(i)  +  A  2 1)  1(X^X(1)/3(1)  +  x^e  — ysign(/3(!))) 

The  conditions  f3 n)  /  0  and  1 5(2) I  <  1  yield  conditions  (2.1)  and  (2.2)  re¬ 
spectively.  □ 

Before  proving  Theorem  1,  we  state  without  proof  one  well-known  compari¬ 
son  result  on  Gaussian  maxima  (see  Ledoux  and  Talagrand,  1991). 


Lemma  2.  For  any  Gaussian  random  vector  (Xi, . . .  ,Xn),  we  have 

E  max  \XA  <  3y/log n  max  J ' EXf  (5) 

l<i<n  l<i<n  V  * 


Proof  of  Theorem  1. 


1.  Analysis  of  Ai(V) 

Note  that  Vj  is  Gaussian  with  mean 


-i 


H  =  E(Vj)  =  Xj  *(i)  (*(i)*(i)  +  A 2I)  (Ai  b  +  2A2/?(1)). 

Recall  that  the  Elastic  Irrepresentable  Condition  is: 


*(V(d  (^a)^(D +a2/) 


-1 


2A2 

sign(/?m)  +  — /3(i) 

Ai 


<1-6. 


(6) 


By  condition  (6), 
Define  Vj  :=  2 Xj 


W\  ^  I1  —  r?)Ai- 


I-X{1)(X^X(1)  +  X2I)  X^ 


-l 


e,  then  Vj  =  Hj  +  Vj.  Note 


M(V)  holds  if  and  only  if  maxje^c  Yi  <  \  and  mm>gsc '  j  >  gjnce 


Ai 


m^s‘  ^  +  £  (1  _  +  1  max{>  and  (7) 

Ai  j 


Ai  Ai 

miiije5'c  V}  minj6 go  fij  +  V) 

Ai  Ai 

now  we  need  to  show  that 


>  —(1  —  ??)  +  —  min  Vjt 
Ai  j 


(8) 


P 


—  max  14  >77,  or  —  min  14-  <  —  ?? 
Ai  jeSc  J  Ai jeSc  J 


0. 


(9) 
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In  fact,  it  is  sufficient  to  show  that  P  maXj^?c  ^  >  rj  — ■>  0.  By  applying 
Markov’s  inequality  and  Gaussian  comparison  results  (5),  we  have 


maxjggc  \Vj\ 
Ai 


>  V  < 


E[max.jesc  M 

Air/ 


/log  (p  -  q ) 


^r^f] 


Straightforward  computation  yields  that 

i£[^2]  =  <x2Xf  [/  -  X(1)  (xf1}X(1)  +  A2iy1  XjJ '  X, 

=  [7  -  2*d)  (x[i)x(D  +  A2/)_1  Xfc  Xj 

+  ^J*(D  (*(Vd)  +  A2/)_1  (X5)X(1))  (x[1}X(1)  +  A  2iyL  X[1}Xj 

<  a2Xj  [/  -  2X(1)  (X^X(1)  +  A,/)"'  X&]  X, 

+  ^XJX( i)  (Afi)^d)  +  A2/)_1  (X5)X(1))  (x5jX(1)  +  A2/)_i  X^X, 
+  *2*J*d)  (X(i)X(i)  +  ^y'  X^I  (X5)X(1)  +  A2/)_1  XJjXj 

=  [7  -  *d)  (^d)^d)  +  A2^)~ '  x?i)  XJ 

<  <?2XjXj  =  no2. 

Put  it  into  inequality  (10),  we  have 


p  maxjg gc  \Vj 
Ai 


so,  condition  (1)  in  Theorem  5  guarantees  that  P  maxj'ggc  I  hi  I  ^  q,  and 

hence  P(  J4)  — ■>  1. 

2.  Analysis  of  A i(U) 

Define  Z%  =  ef  ^X^X^)  +  A 2I^J  xJ\  f  -<  then 

max \Ui\  =  max  | Z{  -  ^ef  ^X^X^)  +  A 2I^j  Ai  b  \ 

<  max  | Zi |  +  ^Ai  [xf^X^  +  A 2/)  b 
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Note  Zj  is  Gaussian  with  mean  0  and  variance 
variZi)  =  a2ej  (x^X^  +  A .2/)_1  (X^X{1))  (x^X^  +  A,/)”*  e* 
<a2ef  (x(7;)X(1)+A2/)_1ei 


< 


a 


H Cmin 

So  by  standard  Gaussian  comparison  theorem  (5),  we  have 


E[ max  \Zi\\  <  3< 


'  a2  log  q 


nCry 


1  -  P 

<  P 

<  P 


{Xl)X(i)+\  2l) 


-1 


x(i)x(i)P(i)  +  x(i)€  ~  ^-sign(/3(i)) 


>  0 


max  \  Ui\  >  p 


1 


1 


-  max  j  Z{  |  +  -Ai 


<-{E 
P 


max  I  Zj 


+  lXl 


Xf1)X{1)+X2I 


rT 

AI  Y 


< 


1  I  /a2l°gg  ,1 


P 


nC„ 


+ 


(x5)X(i,  +  A2/)"  V 


f1? 

}  —  1 

00  1 

00  } 

f1^ 

00  1 

So,  condition  (2)  in  Theorem  1  guarantees  that  P(. Ad)  — >  1.  □ 
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